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Abstract 

Much speculation abounds concerning how expensive performance assessments are or are 
going to be. Recent projections indicate that, in order to achieve an acceptably high generalizability 
coefficient, many additional tasks may need to be added which will enlarge costs. Such 
projections are, to some degree, correct and to some degree simplistic. The current investigation 
uses two synthetic examples, based on published costs and variance components, and a 
constrained optimization procedure to examine the complex relationships among reliability, cost, 
and sample size. The results indicate that the optimal design changes as the number of subjects 
changes. Another set of results confirms what seems to be intuitively expected: as the number of 
subjects grows, the relatively fixed development cost becomes a smaller and smaller percentage of 
the total cost. These two sets of results seem to be directly related. Since, for the smaller samples, 
development costs constitute the majority of total cost, the optimal design includes more raters than 
prompts. That is, the burden of reliability is shifted to the least expensive (in relative terms) part of 
the assessment. 
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Optimal Designs for Performance Assessments: The Subject Factor 

Two of the more immediate perceived roadblocks to the implementation of performance 
assessments are low reliability and high cost. Current understandings of how the two are related, 
however, may suffer from basically two problems. The first problem is that the two are rarely 
addressed simultaneously in an empirical fashion. The second problem, which may actually be a 
cause of the first problem, is the assumption that the two are linked in a Spearman-Brown style 
relationship. The idea that the relationship is much more complex than that was raised by Sanders, 
Theunissen and Baas (1989). This investigation attempts to add to the understanding of this 
complex relationship. 



Speculation about cost 

Many researchers in the last several years have been finding that performance assessments 
produce low generalizability coefficients (e.g. Shavelson and Baxter, 1992; Shavelson, Baxter 
and Gao, 1993; Koretz, Klein, McCaffrey, and Stecher, 1994; Koretz, Stecher, Klein, 
McCaffrey, and Deibert, 1994; Koretz, Stecher, Klein, and McCaffrey, 1994; McWilliam and 
Ware, 1994). Furthermore, they have noticed that these coefficients are not due so much to rater 
variance, which was the scourge of reliability in scoring from the 1960's to the 1980's (c.f. Huot, 
1990), but of task variance or task-by-subject variance. Since the g-coefficients were low, some 
of the researchers (e.g. Shavelson, Baxter and Gao, 1993; McWilliam and Ware, 1994) projected 
the number of tasks necessary to achieve acceptable (e. g. > 0.80) g-coefficients. These 
projections were large (as many as 23 science tasks in Shavelson et al., 1993), which then led to 
the inference that they would be very expensive. 

Other researchers have also estimated a large cost. In discussing large-scale portfolio 
assessment, Reckase (1995) concluded that, compared to current multiple-choice methods, 
portfolios would be a "very expensive alternative (p. 14)." White (1994) held the opinion that, 
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although the cost would come in different places (i.e. scoring instead of development), the overall 
costs would be comparable. Hoover and Bray (1995) to some extent validated this claim by 
showing that the Iowa Writing Test could be conducted for approximately the same cost as the 
Iowa Test of Basic Skills, albeit the former covered a much smaller domain than the latter. 

Until recently, however, the two problems of low reliability and high cost have been 
discussed together theoretically but not joined together empirically. When this happens, the 
relationship is much more complex than it first appears. The assumption that adding more tasks 
will make the assessment both more reliable and more costly relies on three lines of reasoning 
which may or may not be appropriate: first, that the relationship between task and reliability is the 
same as that between number of items and reliability as expressed in the Spearman-Brown 
Prophecy Formula; second, it is not grounded in an empirical technique which takes both concerns 
into account simultaneously; and third, it seems to ignore the sample-dependent nature of reliability 
and cost. In contrast Sanders, Theunissen and Baas (1989) claim that it is actually possible to 
decrease cost while increasing reliability. They also provided a procedure for optimizing an 
assessment design, that is, minimizing cost while holding the g-coefficient at or above a given 
level. Building on that work, Parkes and Suen (1995), using the constrained optimization 
algorithm of Sanders et al. (1989, 1991, 1992), showed that for any given assessment situation, 
there are many optimal designs. It is for the designer on site to say which would be optimal given 
the constraints reasonable to that situation. 

The current investigation adds another piece to the understanding of the complex nature of 
the relationship between cost and reliability. First, the results here indicate that the number of 
subjects is a situational variable which will alter the optimal design of the assessment. Second, as 
the number of subjects changes, so do the proportional relationships between development costs, 
scoring costs, material costs, and the total cost of the assessment. 
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The Synthetic Assessment Situations 

Three different investigations form the basis for the current analyses. The goal was to 
approximate reliability and cost data based on published studies for two different assessment 
situations. The first is a limited writing sample situation and the second is a large-scale portfolio 
assessment. 

The cost data for both situations are taken from Hoover and Bray (1995), who report on 
cost information for an administration of the Iowa Writing Assessment. The assessment tested the 
writing skills of 30,000 school students from grades three to twelve, each of whom wrote two 
pieces of writing. Each sample was scored twice holistically and twice analytically. For this 
assessment, Hoover and Bray estimate that $138,000 was spent in developing the 40 writing 
prompts; $174,410 was spent to score the prompts; and $30,000 was spent for materials. 

In the optimization procedure that is to follow, it is necessary to have an estimate of how 
much cost each aspect of the situation (rater, subject, prompt) contributes to the total cost. In order 
to achieve this, base units of development, scoring, and material costs were calculated and then a 
total cost function constructed. For example, the development cost is dependent on both the 
number of prompts developed and the number of prompts each subject completes. To obtain a base 
unit cost for development, the $138,000 development cost was divided by 40 prompts to obtain a 
development cost of $3450 per prompt, and that was divided by two since each person wrote two. 
This produces the estimate of $1725 for each prompt that each person has to write. Therefore, the 
development cost function is 1725n p , where n p is the number of prompts each person must write. 
The scoring cost ($174, 410) was divided by the number of subjects (30,000), the number of 
prompts per subject (2), and the number of raters or readings per piece (2) to produce a unit 
scoring cost of $1.43 per prompt, per rater, per subject. The materials were estimated to cost 
$1.00 per subject. Therefore, the total cost function is: 



Total Cost = $1725n p + $1.43n p n r n s + $1.00n s . 



( 1 ) 
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Parkes and Suen (1995) produced variance components for fifty subjects writing four 
prompts which were read by three raters. These variance components, which constitute the first 
situation, are given in Table 1. 

Data from the Vermont Portfolio Project as published in Koretz, Stecher, Klein, McCaffrey 
& Deibert (1994) were used to create the second situation. The variance components from the 
Grade 4 writing portfolios were used and are given in Table 2. These components are based on 
portfolios consisting of two parts read by two raters from a total of 1,714 subjects. 

INSERT TABLES 1 AND 2 HERE 



The First Situation 



The first synthetic example combines the Parkes and Suen (1995) variance components 
with the Hoover and Bray (1995) cost estimates. 

The Variance Model 

In the Parkes and Suen (1995) variance model, two facets are fully crossed: writing prompt 
(p), and rater (r). The object of measurement is student's overall writing ability (s). Thus in the 
generalizability framework, the variance model is: 



<„) = tf, 2 + + o] + ofr+K + °lr + 0 2 p sr- 



( 2 ) 



For the optimization analyses, the relative model of measurement was used. Thus, relative 
error variances were estimated through: 



a \S) = & + ^ + -^ 



n r n p 



(3) 




7 



The Subject Factor 7 



where n r and n p are the number of raters and prompts in each particular scenario respectively. The 
G-coefficient of interest was thus: 



w 



0 ? 



a s + c (S) 



(4) 



The Qprimization Procedure 

A branch-and-bound integer programming algorithm, which is a linear programming 
technique, was employed to estimate the optimal combination of raters and prompts. This 
investigation used the solver function of Microsoft EXCEL, version 5.0, to execute the algorithm. 
The variance components from Table 1, the cost function given in equation 1, and the number of 
prompts, raters, and subjects were entered into the EXCEL worksheet. 

The following optimization problem was submitted for analysis. 



Objective Function: Minimize L = Total Cost = $1725n p + $1.43n p n r n s + $1.00n s ;(5) 



Subject to: 



Ep 



of + o\8) 



> 0 . 8 , 



( 6 ) 



n p and n r are integers, 



and n p and n r > 1 . 



(7) 

( 8 ) 



The objective function is to minimize the total cost. Constraint (6) specifies the minimal 
acceptable level of generalizability. Constraints (7) and (8) further delimit the search to a feasibility 
region of positive integers. 
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The results of this analysis produces the optimal number of prompts and raters that will 
assure minimum cost with a g-coefficient at or above 0.8. This analysis was conducted for sample 
sizes ranging from 25 to 50,000. 



The Second Situation 

The second synthetic example combines the Koretz et al. (1994) variance components with 
the Hoover and Bray (1995) cost estimates. 

The Variance Model 

In the Koretz et al. (1994) variance model, two facets are used: part (p), and rater (r). The 
object of measurement is student's overall writing ability (s). Coincidentally, then, the equations 
for the variance model, the relative error variance and the generalizability coefficient are identical 
here to those for the Parkes and Suen data. Therefore, the variance model is given in Equation 2; 
the relative error variance us given in Equation 3; and the generalizability coefficient is given in 
Equation 4. It is worth noting, however, that a g-coefficient was calculated for each of the five 
subscales (purpose, organization, details, voice, and mechanics). That is, each subscale has its 
particular variance components and g-coefficient, as is evident in Table 2. 

The Optimization Procedure 

As with the first situation, the variance components from Table 2, the cost function given 
in equation 1, and the number of parts, raters, and subjects were entered into the EXCEL 
worksheet. 

The following optimization problem was submitted for analysis. 

Objective Function: Minimize L = Total Cost = $1725np + $1.43npn r n s + $1.00n s ;(9) 

, CT 2 

Ep — 1 r r — > 0.8, for each subscale 

+ o\8) 



Subject to: 



( 10 ) 
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n p and n r are integers, 



( 11 ) 



and n p andn r >l. (12) 

The results of this analysis produces the optimal number of prompts and raters that will 
assure minimum cost with a g-coefficient at or above 0.8. This analysis was also conducted for 
sample sizes ranging from 25 to 50,000. 



Results 

In addition to producing an optimal design for each sample size, corresponding 
development, scoring, material, and total costs were also derived. Tables 3 and 4 contain the 
optimal designs at selected sample sizes as well as the dollar figures and percentage of total cost 
attributable to each cost category. 

INSERT TABLES 3 AND 4 HERE 



For both synthetic examples, the same pattern of results emerges, as is evident in Figures 1 
and 2. There is not one optimal answer that holds for all sample sizes. For small samples, the 
optimal designs contain more ratings than prompts. In contrast, the optimal designs for the larger 
samples contain more prompts than ratings. 

INSERT FIGURES 1 AND 2 HERE 



Furthermore, as the sample size increases, the proportions of development, scoring, and 
material costs to total cost changes. In each case, for smaller sample sizes, development costs 
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represent by far the largest proportion of total cost. But as the samples get larger, scoring costs 
take on the lion's share of total cost. 

The first result mentioned above is influenced by the second result. Since the constrained 
optimization algorithm is tasked with minimizing cost while maintaining a g-coefficient of 0.8 or 
better, it will consider relatively lower cost parts in order to boost reliability before considering the 
higher cost items. That means that when development costs are relatively large, ratings are 
considered; when scoring costs are relatively large, prompts are considered. 



Discussion 

This investigation has been designed to shed light on the complexity of the relationship 
between cost and reliability. In order to do so, synthetic examples combining real data from 
different sources were used. There are some drawbacks to this approach. Many assumptions had 
to be made about the cost structures used. Since the variance components and the cost data came 
from different sources, there is no guarantee that the cost function calculated was the most 
appropriate to work with the data. In other words, it would be inappropriate to take these optimal 
designs back to Vermont and implement them. The ideal would be to have cost data from Vermont. 

This study is an improvement on mere speculation because it utilizes data from assessments 
that were actually conducted. It could be improved upon by having both cost and reliability data 
from the same source. This investigation does gain some generalizability, however, since the 
results held for two examples, one based on a large-scale assessment and one based on a small- 
scale assessment. 

The focus of this investigation, however, is the processes and relationships involved, not 
the actual numbers produced. In this regard, some of the results provide some food for thought. 
The results here seem to indicate that the size of the sample being assessed is an important, if not 
the most important, factor that determines optimal designs. 
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Previous speculation that increasing tasks would increase cost appears simplistic in light of 
these results. By relying on generalizability alone without using any cost data, the picture was 
clearer: increase tasks. In the optimization framework, both generalizability concerns (such as size 
of the variance component) and cost concerns are considered. Using this approach, the picture is 
more complex: check the effect of sample size before turning to tasks. 

On the surface, it intuitively sounds counter-productive to add complexity to this issue. At 
a deeper level, though, the previous simple state of the relationship between cost and reliability had 
led to an impasse. If increasing tasks was the only way to get a more reliable assessment; and 
doing so was going to make them even more costly than they already were; performance 
assessments were between a rock and a hard place. The use of the constrained optimization 
procedure to simultaneously consider cost and reliability has provided a route through the impasse. 
It has shown, for example, that many designs can achieve the psychometric constraints. This 
study has added another piece to this alternate route: when simultaneously considering cost and 
reliability, the sample size will affect the optimal design. In the present cases, the number of tasks 
is reduced as sample size increases, thus, at least partially exonerating task as the culprit causing 
high cost. 
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THE CATHOLIC UNIVERSITY OF AMERICA 

Department of Education, O’ Boyle Hall 
Washington, DC 20064 
202 319-5120 

February 27, 1996 
Dear AERA Presenter, 

Congratulations on being a presenter at AERA 1 . The ERIC Clearinghouse on Assessment and 
Evaluation invites you to contribute to the ERIC database by providing us with a written copy of 
your presentation. 

Abstracts of papers accepted by ERIC appear in Resources in Education (RIE) and are announced 
to over 5,000 organizations. The inclusion of your work makes it readily available to other 
researchers, provides a permanent archive, and enhances the quality of RIE. Abstracts of your 
contribution will be accessible through the printed and electronic versions of RIE. The paper will 
be available through the microfiche collections that are housed at libraries around the world and 
through the ERIC Document Reproduction Service. 

We are gathering all the papers from the AERA Conference. We will route your paper to the 
appropriate clearinghouse. You will be notified if your paper meets ERIC'S criteria for inclusion 
in RIE: contribution to education, timeliness, relevance, methodology, effectiveness of 
presentation, and reproduction quality. 

Please sign the Reproduction Release Form on the back of this letter and include it with two copies 
of your paper. The Release Form gives ERIC permission to make and distribute copies of your 
paper. It does not preclude you from publishing your work. You can drop off the copies of your 
paper and Reproduction Release Form at the ERIC booth (23) or mail to our attention at the 
address below. Please feel free to copy the form for future or additional submissions. 

Mail to: AERA 1996/ERIC Acquisitions 

The Catholic University of America 
O'Boyle Hall, Room 210 
Washington, DC 20064 

This year ERIC/AE is making a Searchable Conference Program available on the AERA web 
page (http : //tikkun . ed . asu . edu/aera/) . Check it out! 




Sincerely, 




Director, ERIC/AE 



’If you are an AERA chair or discussant, please save this form for future use. 
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